越来越密集的流量在我们当地的环境中成为挑战,促使需要更好的交通监控和管理系统。与车辆粗加分类相比,细粒度的车辆分类似乎是一个具有挑战性的任务。因此,基本上需要探索车辆检测和分类的鲁棒方法,因此需要进行细粒度。现有的车辆制作和模型识别(VMMR)系统已经开发在同步和受控的流量条件上。需要在复杂,城市,异构和非同步交通条件下坚固的VMMR仍然是开放式研究区域。在本文中,使用深度学习解决了车辆检测和细粒度分类。为了进行相关复杂性进行细粒度分类,专门制备具有高内部和低次间变异的本地数据集THS-10。 DataSet由4250辆汽车型号的10辆车型号,即本田市,本田思域,铃木,铃木博拉,铃木文化,铃木Mehran,Suzuki Ravi,Suzuki Swift,Suzuki Wagon R和Toyota Corolla。此数据集可在线获取。已经探索了两种方法,并分析了从深神经网络的微调和特征提取的车辆分类。进行比较研究,并证明了更简单的方法可以在当地环境中产生良好的结果,以应对复杂的问题,如密集的遮挡和车道偏离。因此,减少了计算负荷和时间,例如,微调成立-V3产生的最高精度为97.4%,最低错误分类率为2.08%。微调MobileNet-V2和Reset-18分别产生96.8%和95.7%的精度。从FC6亚历尼特层的提取特征产生93.5%的精度,错误分类率为6.5%。
translated by 谷歌翻译
We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. Code: https://github.com/hamzapehlivan/StyleRes
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
translated by 谷歌翻译
Transformer models have achieved great success across many NLP problems. However, previous studies in automated ICD coding concluded that these models fail to outperform some of the earlier solutions such as CNN-based models. In this paper we challenge this conclusion. We present a simple and scalable method to process long text with the existing transformer models such as BERT. We show that this method significantly improves the previous results reported for transformer models in ICD coding, and is able to outperform one of the prominent CNN-based methods.
translated by 谷歌翻译
Context-sensitive two-point layer 5 pyramidal cells (L5PCs) were discovered as long ago as 1999. However, the potential of this discovery to provide useful neural computation has yet to be demonstrated. Here we show for the first time how a transformative L5PCs-driven deep neural network (DNN), termed the multisensory cooperative computing (MCC) architecture, can effectively process large amounts of heterogeneous real-world audio-visual (AV) data, using far less energy compared to best available 'point' neuron-driven DNNs. A novel highly-distributed parallel implementation on a Xilinx UltraScale+ MPSoC device estimates energy savings up to 245759 $ \times $ 50000 $\mu$J (i.e., 62% less than the baseline model in a semi-supervised learning setup) where a single synapse consumes $8e^{-5}\mu$J. In a supervised learning setup, the energy-saving can potentially reach up to 1250x less (per feedforward transmission) than the baseline model. The significantly reduced neural activity in MCC leads to inherently fast learning and resilience against sudden neural damage. This remarkable performance in pilot experiments demonstrates the embodied neuromorphic intelligence of our proposed cooperative L5PC that receives input from diverse neighbouring neurons as context to amplify the transmission of most salient and relevant information for onward transmission, from overwhelmingly large multimodal information utilised at the early stages of on-chip training. Our proposed approach opens new cross-disciplinary avenues for future on-chip DNN training implementations and posits a radical shift in current neuromorphic computing paradigms.
translated by 谷歌翻译
需要在机器学习模型中对最小参数设置的需求,以避免耗时的优化过程。$ k $ - 最终的邻居是在许多问题中使用的最有效,最直接的模型之一。尽管具有众所周知的性能,但它仍需要特定数据分布的$ K $值,从而需要昂贵的计算工作。本文提出了一个$ k $ - 最终的邻居分类器,该分类器绕过定义$ k $的值的需求。考虑到训练集的数据分布,该模型计算$ k $值。我们将提出的模型与标准$ K $ - 最近的邻居分类器和文献中的两个无参数版本进行了比较。11个公共数据集的实验证实了所提出方法的鲁棒性,因为所获得的结果相似甚至更好。
translated by 谷歌翻译
本文提出了Mburst,这是一种新型的多模式解决方案,用于视听语音增强功能,该解决方案考虑了有关前额叶皮层和其他大脑区域的锥体细胞的最新神经系统发现。所谓的爆发传播实现了几个标准,以更加可行的方式解决信用分配问题:通过反馈来指导可塑性的标志和大小,并线性化反馈信号。 Mburst从这种能力中受益于学习嘈杂信号和视觉刺激之间的相关性,从而通过扩增相关信息和抑制噪声来归因于语音。通过网格语料库和基于Chime3的数据集进行的实验表明,Mburst可以将类似的掩模重建基于多模态反向传播基线,同时证明了出色的能量效率管理,从而降低了神经元的发射速率,以降低价值,最高为\ textbf {$ 70 \%$}降低。这样的功能意味着更可持续的实现,适合助听器或任何其他类似的嵌入式系统。
translated by 谷歌翻译
通过脑电图信号的情绪分类取得了许多进步。但是,诸如缺乏数据和学习重要特征和模式之类的问题始终是具有在计算和预测准确性方面改进的领域。这项工作分析了基线机器学习分类器在DEAP数据集上的性能以及一种表格学习方法,该方法提供了最新的可比结果,从而利用了性能提升,这是由于其深度学习架构而无需部署重型神经网络。
translated by 谷歌翻译
发现广泛使用的深度学习模型的稳健性差。几乎没有噪音可以欺骗最先进的模型来做出错误的预测。尽管有很多高性能攻击生成方法,但其中大多数直接在原始数据中添加了扰动,并使用L_P规范对其进行测量;这可能会破坏数据的主要结构,从而产生无效的攻击。在本文中,我们提出了一个黑框攻击,该攻击不是修改原始数据,而是修改由自动编码器提取的数据的潜在特征;然后,我们测量语义空间中的噪音以保护数据的语义。我们在MNIST和CIFAR-10数据集上训练了自动编码器,并使用遗传算法发现了最佳的对抗扰动。我们的方法在MNIST和CIFAR-10数据集的前100个数据上获得了100%的攻击成功率,而扰动率较小。
translated by 谷歌翻译